Dataset relevance estimation in storage systems

ABSTRACT

The invention is notably directed to computer-implemented methods and systems for managing datasets in a storage system. In such systems, it is assumed that a (typically small) subset of datasets are labeled with respect to their relevance, so as to be associated with respective relevance values. Essentially, the present methods determine, for each unlabeled dataset of the datasets, a respective probability distribution over a set of relevance values. From this probability distribution, a corresponding relevance value can be obtained. This probability distribution is computed based on distances (or similarities), in terms of metadata values, between said each unlabeled dataset and the labeled datasets. Based on their associated relevance values, datasets can then be efficiently managed in a storage system.

BACKGROUND

The invention relates in general to the field of computer-implementedmethods and systems for managing datasets (e.g., files) in a storagesystem. In particular, it is directed to methods and systems formanaging datasets across storage tiers of a storage system, based ontheir relevance.

Multi-tiered storage systems may comprise several tiers of storage. Suchsystems typically assign different categories of data to various typesof storage media, in order to reduce the global storage cost, whilemaintaining performance. A tiered storage system usually relies onpolicies that assign most frequently accessed data to high-performancestorage tiers, whereas rarely accessed data are stored onlow-performance (cheaper, and/or slower) storage tiers.

SUMMARY

The invention is notably directed to computer-implemented methods andsystems for managing datasets in a storage system. In such systems, itis assumed that a (typically small) subset of datasets are labeled withrespect to their relevance, so as to be associated with respectiverelevance values. Essentially, the present methods determine, for eachunlabeled dataset of the datasets, a respective probability distributionover a set of relevance values. From this probability distribution, acorresponding relevance value can be obtained. This probabilitydistribution is computed based on distances (or similarities), in termsof metadata values, between said each unlabeled dataset and the labeleddatasets. Based on their associated relevance values, datasets can thenbe efficiently managed in a storage system.

BRIEF DESCRIPTION OF THE DRAWINGS

The following detailed description, given by way of example and notintended to limit the invention solely thereto, will best be appreciatedin conjunction with the accompanying drawings.

The accompanying drawings show simplified representations of devices orparts thereof, as involved in embodiments. Technical features depictedin the drawings are not necessarily to scale. Similar or functionallysimilar elements in the figures have been allocated the same numeralreferences, unless otherwise indicated.

FIG. 1 is a flowchart illustrating high-level steps of a method formanaging datasets in a storage system, as in embodiments.

FIG. 2 is a diagram schematically illustrating labeled and unlabeleddatasets, wherein the datasets are associated with supersets, asinvolved in embodiments.

FIG. 3 is another diagram schematically illustrating a message-passingalgorithm, where messages are passed along edges of a heterogeneousbipartite graph, as involved in embodiments.

FIG. 4 schematically represents a storage system, comprising a relevancedetermination unit and a management unit, suited for implementing methodsteps as involved in embodiments of the invention.

FIG. 5 schematically represents a general purpose computerized system,suited for implementing one or more method steps as involved inembodiments.

FIG. 6 depicts a cloud computing environment, in accordance with anembodiment of the present invention.

FIG. 7 depicts abstraction model layers, in accordance with anembodiment of the present invention.

DETAILED DESCRIPTION

Consider for example a storage system wherein applications are run onlarge batches of datasets (e.g., astronomical data repositories,financial transaction logs, medical data repositories). Data that havenot been accessed for long periods of time (also called “cold data”) arestored on cheaper (energy efficient) media such as tapes. However,accessing data from such media is also slower and this implies asubstantial drop in performance of applications running on data storedin these media.

According to a first aspect, the present invention is embodied ascomputer-implemented methods for managing datasets in a storage system.In this system, it is assumed that a (typically small) subset ofdatasets are labeled with respect to their relevance, so as to beassociated with respective relevance values. Essentially, the presentmethods determine, for each unlabeled dataset of the datasets, arespective probability distribution over a set of relevance values. Fromthis probability distribution, a corresponding relevance value can beobtained. This probability distribution is computed based on distances(or similarities), in terms of metadata values, between said eachunlabeled dataset and the labeled datasets. Based on their associatedrelevance values, datasets can then be efficiently managed in thestorage system.

The relevance (or value) of data is a metric associated with datasets(e.g., files) that represents the importance of such datasets to a user.In a cognitive storage system, the relevance of a file can be used todetermine its storage policies to reduce storage costs while retainingreliability and performance for the sets of files. The relevance of afile can be estimated by obtaining samples of important and unimportantfiles from the user and applying a supervised learning algorithm toestimate the relevance metric for other files using the file metadata asfeatures.

In the above scheme, at least distances (or similarities) between eachunlabeled dataset and labeled datasets are taken into account. However,the algorithm may in fact consider distances between each unlabeleddataset and any other datasets in the storage system (including datasetsnot labelled yet). For example, the above probability distribution cannotably be computed based on a sum of weighted, initial probabilitydistributions associated with other datasets of the system, i.e.,including unlabeled datasets. Initial probability distributions of givendatasets are determined as a probability distribution over saidrelevance values, according to relevance values associated with thegiven datasets. The sum is typically weighted according to inversedistances between said each unlabeled dataset and said other datasets.

Preferably, supersets of labeled datasets are defined, based on metadataavailable for the datasets. Thus, each dataset can be associated to atleast one of the supersets defined, by comparing metadata available forsaid each dataset with metadata used to define said at least one of thesupersets. In that case, the needed probability distributions (asassociated with unlabeled datasets) can be determined based on a sum ofweighted, initial probability distributions associated with otherdatasets of the superset(s) with which said each unlabeled dataset isassociated. The needed probability distributions can efficiently becomputed thanks to a message passing algorithm on a heterogeneousbipartite graph.

According to another aspect, the invention can be embodied as a storagesystem, storing datasets. Again, only a subset of the datasets isassumed to be labeled with respect to their relevance. The system mayinclude a relevance determination unit, configured to determineprobability distributions associated with unlabeled datasets and, inturn, corresponding relevance values, following principles discussedabove. The system may further include a management policy unit, thelatter configured to manage datasets in the storage system based on therelevance values accordingly determined, in operation. The managementpolicy unit may for instance be configured to store datasets acrossstorage tiers of the storage system, according to a storage managementpolicy that takes relevance values of the datasets as input.

According to a final aspect, the invention is embodied as a computerprogram product for managing datasets in a storage system. The computerprogram product may include a computer readable storage medium havingprogram instructions embodied therewith, wherein the programinstructions are executable by a computerized system to cause to takesteps according to the present methods.

Computerized device, systems, methods, and computer program productsembodying the present invention will now be described, by way ofnon-limiting examples, and in reference to the accompanying drawings.

As present Inventors have realized, the known supervised learningalgorithms (e.g., naïve B ayes or Information Bottleneck) require alarge number of training samples to achieve acceptable estimationaccuracy. To address this, they have developed new learning methods thatcan achieve high estimation accuracy while requiring substantially fewertraining samples.

In reference to FIGS. 1-4, an aspect of the invention is firstdescribed, which concerns a computer-implemented method for managingdatasets 51, 52 in a storage system 1. A subset 51 of the datasets 51,52 are assumed to have been labeled with respect to their relevance. Thedatasets 51 are thus associated with respective relevance values, whichrepresent the importance or value of such datasets 51 to a user, or anapplication.

Essentially, the present methods revolve around determining (step S20,FIG. 1), for each unlabeled dataset 52 of the datasets 51, 52, arespective probability distribution over a set of relevance values.This, in turn, makes it possible to obtain a corresponding relevancevalue for each dataset 52 that is initially not labeled. The neededprobability distribution is computed based on distances, in terms ofmetadata values, between the unlabeled dataset 52 and the labeleddatasets 51. Said distances could as well be regarded as similaritiesbetween the datasets. I.e., the more similar the datasets, the closerthey are and the shorter the distance between them. This assumes thatsuitable distance metrics are available, thanks to which distancesbetween the datasets can be computed, as exemplified later.

As a result, all the datasets 51, 52 as stored on the storage system 1can be associated with relevance values, based on which the datasets canbe managed S30 in the storage system 1. The relevance of the datasetscan indeed advantageously be relied on to efficiently manage thedatasets, e.g., to store, duplicate (reach a given level of redundancy),refresh, and/or garbage collect the datasets, etc., according to theirrelevance. That is, the dataset management policy shall depend on therelevance values of the datasets 51, 52.

Interestingly here, the relevance of the initially unlabeled datasets 52is inferred from a (reduced) training set of already labeled datasets51. This is achieved by leveraging probability distribution functionsassociated with the already labeled datasets 51. From these initialdistribution functions, distribution functions can be inferred for theunlabeled datasets 52 too. From there, relevance values can be obtained,obtained values are then used to manage the datasets in the storagesystem.

“Unlabeled datasets” 52 are datasets whose relevance has not initiallybeen rated by a user or an application, contrary to labeled datasets 51,which can be regarded as a training set. Now, when running the aboveprocess, the system extrapolates relevance values for the initiallyunlabeled datasets 52 from relevance values of the initially labeleddatasets 51, so as to rate the unlabeled datasets 52. On completion ofthe process, the initially unlabeled datasets 52 are associated withrelevance values and thus can be regarded as “labeled” by the system.Still, the datasets of the training set, as initially rated by a user/anapplication, should be distinguished from the datasets 52 as eventuallylabeled by the system, as the datasets 52 are automatically andcognitively rated, at a subsequent stage of the process. Typically, onlya small or very small (e.g., less than 1 or 0.1%) fraction of the totaldatasets need initially be rated, prior to implementing the presentmethods. This represents a substantial improvement over known supervisedlearning techniques, which typically require larger training sets.

The present algorithms may in fact consider distances between eachunlabeled dataset and any other datasets, these possibly includingunlabeled datasets. In that respect, the needed probabilitydistributions are preferably computed based on a sum of weighted,initial probability distributions associated with other datasets of thesystem (i.e., any type of datasets, possibly including unlabeleddatasets). As explained later in detail, an initial probabilitydistribution of a given dataset can be determined S10 as a probabilitydistribution over a set of relevance values, based on a known relevancevalue, associated with that given dataset.

For example, the above sum can be computed as a sum of weighted, initialprobability distributions associated with labeled datasets of thesystem, where initial probability distributions can easily beextrapolated from known relevance values as initially associated withthe labeled datasets 51.

Yet, sophisticated algorithms, e.g., relying on message-passing, can beinvolved to obtain massive sets of probability distributions associatedwith unlabeled datasets 52, whereby a plurality of datasets 52 canaltogether get rated, based on a restricted set of labeled datasets 51,as explained later in detail. Such algorithms will typically involvedistances (or similarities), in terms of metadata values, between, onthe one hand, each of the plurality of plurality of datasets 52 and, onthe other hand, other datasets, these including other, unlabeleddatasets 52, i.e., not only the labeled datasets 51.

For instance, the sum of initial probability distributions may beweighted according to inverse distances between, on the one hand, theunlabeled dataset 52 (whose probability distribution in terms ofrelevance value is sought) and, on the other hand, the other datasetsconsidered. Suitable distance metrics can be relied on, as discussedlater. Hence, the closer the datasets considered are to the unlabeleddatasets of interest, the more impact they have on the soughtprobability distributions.

The required weights may for instance be taken as any reasonablefunction (e.g., an inverse power) of the distance. Preferably, theweights are equal to the inverse of the distance between the datasets.

Referring now more particularly to FIGS. 2, 3, in embodiments, thepresent methods further involve supersets 61-63 of labeled datasets 51.Such supersets are defined prior to determining probabilitydistributions associated with the unlabeled datasets 52. The supersetscan for instance be defined based on metadata (e.g., metadata fields)available for the datasets 51, 52, as explained later in detail. Then,as illustrated in FIG. 2, each dataset 51, 52 can be associated to oneor more of the supersets 61-63 so defined. This can be achieved bycomparing metadata (e.g., metadata fields) available for the datasets51, 52 with the metadata (e.g., metadata fields) used to define each ofthe supersets 61-63.

In that case, the probability distributions sought (which pertain tounlabeled datasets 52) will typically be determined based on one or moresums of weighted, initial probability distributions associated withother datasets of the one or more supersets 61-63, respectively.

At least one sum of weighted distributions is needed for each unlabeleddataset 52 for which one wants to compute a probability distribution.I.e., this sum involves probability distributions associated withdatasets of the at least one superset with which said each unlabeleddataset is associated. However, since each unlabeled dataset 52 willlikely be associated with several supersets, the needed probabilitydistributions (as eventually associated with unlabeled datasets 52) willlikely be determined, each, based on several sums of weighted, initialprobability distributions, as respectively obtained for those severalsupersets 61-63.

Note that “supersets” are sometimes referred to as “file-sets” in thepresent description, as the datasets typically form files. As we shallsee, a superset is typically defined as the set of all files in thestorage system 1 that have at least a fraction q of the metadata fieldsin a group of metadata fields. Thus, each dataset (e.g., a file) in asuperset (e.g., a file-set) has a number of common metadata fields withall other files of the same superset.

Also, for each dataset, any information about the dataset that can beobserved by the system can be considered as part of its metadata. Thisincludes both content-dependent information and context-dependentinformation. Thus, the notion of supersets can be extended tocontext-dependent supersets.

The distances between datasets are preferably computed according tospecific distance metrics, i.e., metrics that are respectivelyassociated with the supersets 61-63 of interest. Suitable distancemetrics can, for instance, be determined based on metadata fieldsassociated with the datasets in each superset 61-63 of interest. Asdiscussed below, this may not only include common metadata fields (i.e.,metadata fields shared by all datasets of a given superset) but, also,unshared (i.e., rare) metadata fields. In all cases, distinct supersets61-63 may have distinct distance metrics associated therewith, sincedistinct supersets typically involve distinct sets of metadata fields.

For example, if distance metrics are determined based on common metadatafields shared by all datasets of a given superset 61-63 of interest,then the distance between two datasets f₁, f₂ of that superset can forinstance be computed as a symmetric divergence measure between twoconditional probability distributions P(R|k, v_(k)(f₁)) and P(R|k,v_(k)(f₂)). Here the two conditional probability distributionsrespectively pertain to the two datasets f₁, f₂. A Jensen-Shannondivergence may for instance be used. The above conditional probabilitydistribution P(R|k, v_(k)(f)) is an empirical distribution ofprobability, which is obtained from the labeled datasets 51. Such aconditional probability distribution can be regarded as a probability ofobserving a given relevance value R when a file f has metadata valuesv_(k)(f), for a set of metadata fields k.

In more sophisticated embodiments, the distance metric associated to agiven superset is determined as a combination of: (i) symmetricdivergence measures between conditional probability distributions (asexemplified above) and (ii) distances between uncommon (or rare)metadata fields. In the latter case, the distance can for example beobtained by applying a logistic regression model for the metadata valuesof metadata fields for two files and then by comparing the outcomes, seeEqs. (40)-(41) in section 2. Note that considering the ranges of valuesfor metadata fields can be leveraged to define context-dependentsupersets.

In embodiments relying on supersets, as assumed in FIG. 2, the neededprobability distributions (as eventually associated with unlabeleddatasets 52) can be computed thanks to auxiliary probabilitydistributions. That is, for each unlabeled dataset 52, an auxiliaryprobability distribution (over a set of relevance values) is computedfor each superset 61-63 of interest, i.e., to which said each unlabeleddataset 52 belongs. Each auxiliary probability distribution can forinstance be computed as a sum of weighted, initial probabilitydistributions, as explained earlier. I.e., the initial probabilitydistributions are respectively associated with all datasets (but saideach unlabeled dataset) of a given superset 61-63. Again, the initialprobability distributions are preferably weighted according to inversedistances between the datasets.

Eventually, the probability distribution sought for a given, unlabeleddataset 52 can be obtained as a simple function of all the auxiliaryprobability distributions computed for that given, unlabeled dataset 52.Once all auxiliary probability distributions have been obtained, severalways can be contemplated to obtain the final probability distribution.One may for instance multiply all the auxiliary probabilitydistributions (as computed for each superset with which that given,unlabeled dataset is connected), to average out relevance values asobtained from the several supersets to which this dataset is connected.One may further multiply the result by the input distribution, asinitially attributed to that given, unlabeled dataset 52, atinitialization, and normalize the result, e.g., by dividing the resultof the multiplication by a constant such that the probabilities sum toone, as explained in detail in sect. 2. In variants, only a subset ofthe auxiliary probability distributions may be used, selected, e.g.,based on the number of metadata fields involved in the correspondingsupersets.

As evoked above, the present methods typically request an initializationstep, during which probability distributions as associated with thedatasets 51, 52 are initialized S10. This may notably be achieved asillustrated in FIG. 3 (iteration 0). That is, a Dirac delta function maybe associating to each labeled dataset 51, since the latter has a knownrelevance value. I.e., the resulting Dirac delta function is centered onthis known relevance value. However, because the relevance value ofunlabeled datasets 52 is initially unknown, by definition, one mayassociate them with uniform distributions, i.e., a distribution that isuniform over each potential relevance value. Still, other initializationschemes could be contemplated, which may for example reflect a knownbias or prejudice in the statistical distributions over relevancevalues, where applicable.

The present methods are preferably implemented thanks to amessage-passing algorithm, as further illustrated in FIG. 3. To thataim, a heterogeneous bipartite factor graph need be defined. I.e., agraph is defined, which exhibits two types of nodes. Thus, datasets andsupersets are associated with a first type of nodes and a second type ofnodes of the graph, respectively, as assumed in FIG. 3.

With the help of such a graph, unlabeled datasets 52 can easily beassociated with supersets, i.e., by merely connecting each unlabeleddataset 52 to one or more nodes of the second type of the graph. Thisway, the probability distribution associated with each unlabeled dataset52 can now be computed S24 thanks to a message passing algorithm run onthe heterogeneous bipartite graph, whereby probability distributions arepassed as messages along edges of the graph that connect pairs of nodesof different types. The probability distributions are refined, at eachiteration.

The messages P_(R)(f) sent at each iteration of the algorithm to theconnected datasets f are based on probability distributions associatedwith other datasets (i.e., datasets that are also connected to the samesupersets 61-63) and distances between the connected datasets, asexplained earlier.

For example, assume that each dataset is suitably connected to eachsuperset (or file-sets) to which it belongs, as per some suitablemetrics. For example, in FIG. 3, files f₁-f₃ are, each, connected to thefirst file-set FS₁, while all files f₁-f₅ are, each, connected to thesecond file-set FS₂. Each file-set FS_(i) is associated with a distancemetric obtained from the various metadata fields involved, whichindicates the similarity of files with respect to their relevancevalues. At the beginning of the iterative message-passing algorithm(iteration 0), the distributions of the relevance value probabilitiesare initialized to delta functions for the labeled files and to uniformdistributions for the unlabeled files, as evoked earlier. At eachiteration, probability distributions are sent to the file-set nodes.Following a computation at each file-set node using the messagesreceived and the distance metric associated with that file-set node, theresults of the computation are sent back as messages to the file nodes.The algorithm can already be stopped after one iteration, providing arough estimate of the relevance distributions for the unlabeled files,or it can be continued for a number of iterations. Eventually, modifiedprobability distributions are obtained (more or less centered on a givenvalue, indicative of the relevance of the files), from which relevancevalues can be obtained, as seen in FIG. 3 (iteration n).

The present schemes all rely on labeled datasets 51, i.e., a trainingset, which may be obtained thanks to inputs of a user 40 (or users), or,even, applications 30. Referring to FIG. 4, embodiments of the presentmethods may accordingly include (prior to determining the probabilitydistributions) steps aiming at rating datasets 51 selected for thetraining. For example, as assumed in FIG. 4, user ratings may bereceived at step S2 by the relevance determination unit 22, based onwhich the needed probability distributions may subsequently be computedS20. As said earlier, only a small fraction of the datasets is typicallyneeded to train the system, e.g., less than 1% or 0.1% of the totalnumber of datasets in the storage system 1.

As further seen in FIG. 4, once relevance values have been obtained S20,e.g., for files as identified at steps S2 or S3 (i.e., as per exchangeswith the applications/users or the storage units 10), a suitable storagemanagement policy may be invoked by unit 24, which takes the determinedrelevance values as input S29. As a result, this storage managementpolicy may notably instruct S31 to store, duplicate, garbage collect,etc., the datasets 51, 52 across storage tiers 11-13 of the storagesystem 1, based on their associated relevance values. This includes theinitial datasets 51 (those of the training set, which have beeninitially rated) and, all the more, the unlabeled datasets 52, for whichrelevance values were automatically determined S20. Statistics as to thedistribution of relevance values may further be aggregated andmaintained, in order to design suitable storage management policies forthe storage system 1. In addition, the management policy may becontext-dependent, as evoked earlier.

Referring now to FIGS. 4, 5, another aspect of the invention is brieflydescribed, which concerns the storage system 1 itself. As explainedearlier, the system comprises a relevance determination unit 22.Consistently with principles underlying the present invention, the unit22 is configured to determine S20, for each unlabeled dataset 52, arespective probability distribution over a set of relevance values,which, in turn, allows a corresponding relevance value to be obtained.As discussed above, the probability distributions are computed based ondistances, in terms of metadata values, between the datasets 51, 52.

The system also includes a management policy unit 24, configured tomanage S30 datasets 51, 52 in the storage system 1, based on therelevance values as determined by the unit 22. As said above, themanagement policy unit 24 may notably be designed to store datasets 51,52 across storage tiers 11-13 of the storage system 1, according to astorage management policy that precisely depends on the relevance valuesdetermined S20 for the datasets 52.

In the example of FIG. 4, the storage units 10 comprise three tiers11-13 of storage. More generally though, the system may comprise twotiers, or more than three tiers of storage. A tiered storage system isknown per se. A tier is typically defined as a homogenous collection ofstorage devices of a same kind, having all similar (if not identical)storage characteristics. Typically, yet, the system will involve threetiers of storage. For instance, the units 10 depicted in FIG. 4 involvesSSD devices 11 (first tier), high-end disks 12 (second tier), and tapedrives 13 (third tier). Yet, additional tiers could be involved, e.g.,low-end disks could be used in an intermediate tier between tiers 12 and13.

The datasets considered here can be any consistent set of data, whosegranularity may range between, e.g., data blocks (i.e., physicalrecords, having a given, maximum length) and files (i.e., collections ofblocks or file fragments), or collections of files. More generally, itmay be any sequence of bytes or bits, or file fragments, having apredefined format or length.

The datasets stored across the tiers are likely to be accessed by one ormore applications 30 as the latter interact S1 with the storage units10. By interacting with the units 10, applications 30 consume data asinput, which input data need be fetched from the storage units 10, andalso produce new data, which may need be stored on the units 10. Thus,new datasets may constantly appear, which may need be rated S20according to methods as discussed herein, whence the advantage oflearning techniques.

Next, according to a final aspect, the invention can also be embodied asa computer program product. The latter will typically be a computerreadable storage medium having program instructions embodied therewith,which instructions are executable by one or more processors, e.g., of aunit 101 such as depicted in FIG. 5, to implement functions of a unit22, 24 such as described above.

The above embodiments have been succinctly described in reference to theaccompanying drawings and may accommodate a number of variants. Severalcombinations of the above features may be contemplated, as exemplifiedbelow.

The programs described herein are identified based upon the applicationfor which they are implemented in a specific embodiment of theinvention. However, it should be appreciated that any particular programnomenclature herein is used merely for convenience, and thus theinvention should not be limited to use solely in any specificapplication identified and/or implied by such nomenclature.

Example embodiments rely on graph-based, semi-supervised learningmethods, which can achieve high estimation accuracy with few trainingsamples.

Consider a system with files denoted by f_(i), i=1, . . . , N. For eachfile, f_(i), any information about the file that can be observed by thesystem 1 can be considered as part of its metadata. This includes bothcontent-dependent information and context-dependent information.Content-dependent metadata depends directly on the content of the fileand can only change if the content changes. Examples ofcontent-dependent metadata include:

File-system metadata like path, owner, date created, etc.; and

Metadata specific to certain types of files, e.g., camera model,aperture, focal length and so on for photos, artist, album, genre and soon, for music files, and, e.g., primary investigator, declination, rightascension and so on, for radio astronomy files.

In contrast to content-dependent metadata, context-dependent metadatadepends on how the external world interacts with the file. Examples ofcontext-dependent metadata include:

The users/applications accessing the file, e.g., the number of timeseach user has accessed the file in a given time window;

The publications that refer to the radio telescope observation datacontained in a file

Current location of the file, e.g., on-premise or cloud, SSD, HDD, ortape; and

External knowledge-bases, e.g., social-media trends related toinformation contained in a file.

Context-dependent metadata can change over time even if the content ofthe file remains the same. Although context-dependent metadata alsoindirectly depends on the content, it depends, more importantly, on howthe external world (i.e., users, applications, and systems) perceivesand interacts with this content (e.g., the same file may be perceivedrelevant by one user but irrelevant by another) and therefore providesadditional information that can be used to assess data relevance.

The metadata of a file f can be represented by an arbitrary set ofkey-value pairs, i.e., {(k₁(f), v_(k1)(f)), (k₂(f), v_(k2)(f)), . . . }.The following notation will be used to represent its metadata andrelevance:

K(f): the set of all metadata fields (keys) that can be observed forfile f;

K(F): the set of all metadata fields (keys) that can be observed for allfiles in a set F;

$\begin{matrix}{{K(F)} = {\bigcup\limits_{f \in F}{K(f)}}} & (1)\end{matrix}$

v_(k): metadata value corresponding to the metadata field k;

V_(k)(f): the set of all metadata values corresponding to the metadatafield k that can be observed for file f;

V _(k)(f)={v _(k) ⁽¹⁾(f),v _(k) ⁽²⁾(f), . . . ,v _(k) ^((|V) ^(k)^((f)|))(f)}  (2)

If k∉K(f) then V_(k)(f)=0. For single-valued metadata fields, the setcomprises only one value per file;

V_(k)(F): the set of all metadata values corresponding to the metadatafield k for all files in a set F;

$\begin{matrix}{{V_{k}(F)} = {\bigcup\limits_{f \in F}{V_{k}(f)}}} & (3)\end{matrix}$

R(f): the relevance of file f expressed as a value from a finite set ofvalues, e.g., {1, 2, . . . , 10} or {“highly relevant”, “relevant”,“less relevant”, “irrelevant”}. Of course, many variants are possible.As subjective as they may appear, such values nevertheless define a setof distinct values, which subsequently impact the way the files aremanaged;

N_(R): number of relevance classes in the system (each relevance classcorresponds to one relevance value);

P_(R)(f): probability distribution function of the relevance of file fover the set of N_(R) relevance classes. This is a delta function forfiles 51 for which the relevance is assigned during training, as notedin sect. 1; and

F_(train): set of files used for training.

Note that the metadata fields and their values can be defined in manydifferent ways and that the values can also be sets. This allows forsufficient flexibility in choosing the metadata that is relevant fordifferent types of files, e.g., information about artist, album andgenre may be relevant for music but not for photos.

In terms of the above formulation, a context C is defined as a set ofmetadata fields, K(C), where each of these metadata fields can takevalues in a certain set, V_(k)(C), V_(k)∈K(C). Below are a few examplesof contexts:

All photos taken during a vacation. In this case, the metadata fields ofinterest could be the file extension, which could take values .jpg,.jpeg, etc., the date created, which could take values that correspondto the vacation period, and geolocation, which could take values thatcorrespond to the locations visited during the vacation.

All files belonging to a project. In this case, the metadata fields ofinterest could be the file path, which could take values correspondingto the folders in which the project files are located, file owners,which could take values corresponding to the people involved in theproject, document keywords, which could take values corresponding toproject related keywords.

As can be seen, a very large number of contexts can be defined dependingon the metadata fields and the range of values they can take.

Although the above formulation allows flexibility, it may also presentsome challenges in designing a suitable learning algorithm.

Different sets of metadata fields. Defining a distance between two filesrequires that we have the same set of metadata fields available for bothfiles. If two files share a subset of their metadata fields, this subsetcan be used to define a partial distance. The remainder of the metadatafields may need be handled separately (e.g., using logistic regression)to obtain another partial distance and then combined with theaforementioned distance to obtain a complete distance measure. In somecases, e.g., the distance between a music file and a photo, the subsetof common metadata fields may be too restrictive and perhaps irrelevantfor the purpose of determining a distance. In such cases, we need notone but multiple distance metrics to treat data of different types offiles.

Selecting samples to send to user for training. In practice, the filesneed be sampled and sent to the user to obtain relevance labels.However, if the samples were selected uniformly randomly, it may happenthat all the files sampled are of one type only (e.g., all music filesor all photos) or that all files are of similar relevance.

Multiple values per metadata field. Distance metrics can be definedassuming a single value per metadata field. For instance, for a givenmetadata field k, the partial distance between two files having twodifferent values v_(k)(f₁) and v_(k)(f₂) can be computed as a symmetricdivergence measure (e.g., Jensen-Shannon divergence) between theconditional probability distributions P(R|(k, v_(k)(f₁))) and P(R|(k,v_(k)(f₂))), where these distributions are obtained empirically from thetraining samples. This may be extended to cases where there may bemultiple values associated with a metadata field of a file. One possiblemethod is to define the partial distance between two files having twosets of values {v_(k, 1)(f₁), v_(k, 2)(f₁), . . . } and {v_(k, 1)(f₂),v_(k, 2)(f₂), . . . }, as a symmetric divergence measure betweenconditional probability distributions P(R|(k, {v_(k, 1)(f₁),v_(k, 2)(f₁), . . . })) and P(R|(k, {v_(k, 1)(f₂), v_(k, 2)(f₂), . . .})). This may further be simplified by using conditional independenceassumptions as used in the naïve Bayes model.

File-sets. To address the first challenge, we introduce the notion ofsupersets, here assumed to be file-sets. Given a set of keys (metadatafields) K, a file-set, F_(q)(K), with respect to K with parameter q isdefined as the set of all files in the system that have at least afraction q of the keys in K.

F _(q)(K)={f:|K(f)∩K|/|K|≥q}, 0≤q≤1.  (4)

We use the notion of file-sets to scan all the files in the system andthe metadata corresponding to them to identify the different types offiles in the system (e.g., music files, photos, documents, etc.),assuming a fixed parameter q. Thus, we obtain a file-set for each typeof file. Note, however, that a file-set could also be defined by a user,e.g., by choosing a set of metadata fields and a value for the parameterq.

The notion of file-sets can be extended to context-dependent file-sets.A context-dependent file-set F_(q)(C) with respect to context C withparameter q is defined as the set of all files that have at least afraction q of the metadata fields in the context, K(C), and have thevalues of the corresponding fields within set of values, V_(k)(C),defined by the context.

F _(q)(C)={f:|K(f)∩K|/|K|≥q and v _(k)(f)∈V _(k)(C)∀k∈K(f)∩K}.  (5)

Now, suppose that we have N_(FS) file-sets denoted by FS_(i), i=1, . . ., N_(FS), as in FIG. 3 (where N_(FS) is equal to 2 in this simpleexample). These could be automatically obtained, user-defined, or evencontext-dependent. Regardless of how they are obtained, for values ofthe parameter q close to one, we have the desirable property that eachfile in a file-set has a significant number of common metadata fieldswith all other files in the same file-set. This allows us to definesuitable distance metrics, D_(FSi), over the set of keys in eachfile-set, FS_(i). The weights associated with the different metadatafields in a distance metric can for instance be learned by training datawith files from that file-set. The notion of file-sets also allows for amore effective sampling of files for training. In particular, instead ofsampling uniformly from all the files in the system, we can now sampleuniformly from each of the different file-sets, e.g., we could sample afew music files and a few photos as they are likely to belong todifferent file-sets.

Consider a file-set with m metadata fields for which a distance-metricis to be defined. Let P_(ij)(R), i=1, . . . , m, j=1, . . . , m_(i),denote the conditional probability distributions of the relevance Rgiven that the ith metadata field, k_(i), has the value j, that is,

P _(ij)(R)=P(R|v _(ki) =j), i=1, . . . ,m, j=1, . . . ,m _(i).  (6)

Let d_(i)(j, j′) denote the distance between a metadata field, of twofiles taking the values j or j′. Let the Kullback-Leibler divergencebetween two probability distributions P and Q be denoted by d_(KL)(P∥

):

$\begin{matrix}{{d_{KL}\left( {P{}Q} \right)} = {\sum\limits_{i}{{P(i)}\log \frac{P(i)}{Q(i)}}}} & (7)\end{matrix}$

Some possible choices for d_(i)(j, j′) are given below.

Symmetrized Kullback-Leibler Divergence.

di(j,j′)=d _(KL)(P _(ij) ∥Pij′)+d _(KL)(P _(ij′) ∥P _(ij))  (8)

λ-Divergence. For 0<λ<1,

di(j,j′)=λd _(KL)(P _(ij) ∥λP ^(ij)+(1−λ)P _(ij′))+(1−λ)d _(KL)(P _(ij′)∥λP _(ij)+(1−λ)P _(ij′)).  (9)

Jensen-Shannon Divergence. λ-Divergence with λ=0.5. The Jensen-Shannondivergence is bounded by 1 for two probability distributions (given thatone uses the base 2 logarithm) and 0 because mutual information isnon-negative.

0≤d _(i)(j,j′)≤1  (10)

Now, the distance between two files f_(x) and f_(y) with metadata valuesv_(ki)(f_(x))=x_(i) and v_(ki)(f_(y))=y_(i), i=1, 2, . . . , m, can bedefined as a linear combination of the distances between the metadatavalues of the two files, d_(i)(x_(i), y_(i)), using parameters α=(α₁, .. . , α_(m)), α_(i)≥0, as follows:

$\begin{matrix}{{D_{\alpha}\left( {f_{x},f_{y}} \right)} = {\sum\limits_{i = 1}^{m}{\alpha_{i}{d_{i}\left( {x_{i},y_{i}} \right)}}}} & (11)\end{matrix}$

If this distance is to be a good estimator of difference in relevance,then the D_(α)(f_(x), f_(y)) must typically be a monotonicallyincreasing function of the absolute difference of their relevance,|R(f_(x)(−R(f_(y))|. For example,

D _(α)(f _(x) ,f _(y))∝|R(f _(x))−R(f _(y))|,  (12)

or

D _(α)(f _(x) ,f _(y))∝e ^(|R(f) ^(x) ^()−R(f) ^(y) ^()|).  (13)

Given a set of files for training, F_(train), whose metadata values aswell as relevance are known, the parameters α can be obtained byminimizing the mean squared-difference between D_(α)(f_(x), f_(y)) andy|R(f_(x))−R(f_(y))| (or γe^(|R(f) ^(x) ^()−R(f) ^(y) ^()|)):

$\begin{matrix}{\underset{\alpha,\gamma}{\arg \; \min}{\sum\limits_{f_{x} \in F_{train}}{\sum\limits_{f_{y} \in F_{train}}\left( {{D_{\alpha}\left( {f_{x},f_{y}} \right)} - {\gamma {{{{R\left( f_{x} \right)} - {{R\left( f_{y} \right)}\left.  \right)^{2}}},{s.t.},\alpha_{i},{\gamma > {0\; {\forall i}}},{or}}}}} \right.}}} & (14) \\{{\underset{\alpha,\gamma}{\arg \; \min}{\sum\limits_{f_{x} \in F_{train}}{\sum\limits_{f_{y} \in F_{train}}\left( {{D_{\alpha}\left( {f_{x},f_{y}} \right)} - {\gamma \; e^{{{R{(f_{x})}} - {{R{(f_{y})}}}}}}} \right)^{2}}}},{s.t.},\alpha_{i},{\gamma > {0\; {\forall{i.}}}}} & (15)\end{matrix}$

The above are linear programming problems, solving which yields therequired values of α and hence the distance metric between files D_(α).

The exponential definition of relevance is preferably chosen. This isbecause, for files from the same class, the difference|R(f_(x))−R(f_(y))| is equal to 0. Moreover, the divergence measure forthe identical distributions is also zero. Therefore, taking into accountthe constraint α_(i), γ>0 the linear programming problem is flattenedand has multiple local minima, which was also observed in practice.Therefore, an alternative definition is similarity metric using theJensen-Shannon divergence for d_(i)(j, j′):

$\begin{matrix}{{S_{\alpha}\left( {f_{x},f_{y}} \right)} = {\sum\limits_{i = 1}^{M}{{\alpha_{i}\left( {1 - {d_{i}\left( {x_{i},y_{i}} \right)}} \right)}.}}} & (16)\end{matrix}$

The similarity between two files increases with more common meta-datavalues and equal to 0 in case of total non-similarity. The γ parametercan also be placed differently:

γe ^(|R(f) ^(x) ^()−R(f) ^(y) ^()|)|  (17)

e ^(γ|R(f) ^(x) ^()−R(f) ^(y) ^()|)  (18)

e ^(|R(f) ^(x) ^()−R(f) ^(y) ^()|)  (19)

As a result, the linear programming problem has one global minimum.Experiments on datasets show that the best accuracy is achieved bysetting the parameter γ equal to one, as in Eq. (19).

Metric learning. Solving the linear programming problem defined above iscrucial for obtaining the required values of α. The α parameters definethe influence of every metadata field on the distance between files andthus on determining the relevance.

There exist several methods in the literature for solving linearprogramming problems with bounded constraints. One of them is the trustregion reflective method, which is robust for many unbounded and boundedconstraint problems and is chosen as a default algorithm forminimization problem in many frameworks. However, there are scalabilityand performance issues with this approach. Therefore, another well-knownalgorithm for parameter estimation, called Limited-memory BFGS orL-BFGS, is preferably used. It is an optimization algorithm of thefamily of quasi-Newton methods to solve the optimization unboundedproblems of the form min_(ω∈R) _(d) f(ω). The method approximates theobjective function locally as a quadratic one and does not require tocalculate the second partial derivatives for constructing the Hessianmatrix which is approximated by previous gradient evaluations. As aresult, L-BFGS achieves rapid convergence in comparison with trustregion reflective method and is scalable. For the results presentedhere, the L-BFGS-B algorithm, a variation of the L-BFGS algorithm forbounded constrained optimization is used. The LBFGS algorithm is usedfor training logistic regression models considered in the results.

Closest neighbor rule (CNR). The relevance of a new file f_(z) can thenbe estimated using the learned distance metric or similarity metric asfollows:

$\begin{matrix}{{{R\left( f_{z} \right)} = {R\left( f_{x}^{*} \right)}},{{{where}\mspace{14mu} f_{x}^{*}} = {\underset{f_{x} \in F_{train}}{\arg \; \min}\mspace{11mu} {{D_{\alpha}\left( {f_{x},f_{z}} \right)}.}}}} & (20) \\{{{R\left( f_{z} \right)} = {R\left( f_{x}^{*} \right)}},{{{where}\mspace{14mu} f_{x}^{*}} = {\underset{f_{x} \in F_{train}}{\arg \; \max}\mspace{11mu} {{S_{\alpha}\left( {f_{x},f_{z}} \right)}.}}}} & (21)\end{matrix}$

The closest neighbor rule approach is iterating over the whole trainingset for estimating the relevance of each file, which may not be the mostefficient. Moreover, this makes it rather difficult to parallelize thecomputation because we need to broadcast the whole training set to allnodes in the computing cluster.

Clusteroids (CL-ζ). As a solution, an alternative approach using thenotion of clusteroids may preferably be used. For each relevance class rin the system, r=1, . . . , N_(R), its clusteroid h(r) is defined as thefile belonging to the relevance class with the least sum of distances(or highest sum of similarity) to all other files in that relevanceclass. Let F(r)⊂F_(train) denote the set of files in relevance class r.Then,

$\begin{matrix}{{{{h(r)} = {\underset{f_{x} \in {F{(r)}}}{\arg \; \min}{\sum\limits_{f_{y} \in {F{(r)}}}{D_{\alpha}\left( {f_{x},f_{y}} \right)}}}},{r = 1},\; {.\;.\;.}\mspace{14mu},N_{R}}{or}} & (22) \\{{{h(r)} = {\underset{f_{x} \in {F{(r)}}}{\arg \; \max}{\sum\limits_{f_{y} \in {F{(r)}}}{S_{\alpha}\left( {f_{x},f_{y}} \right)}}}},{r = 1},\; {.\;.\;.}\mspace{14mu},N_{R}} & (23)\end{matrix}$

Now, instead of iterating through the entire training set as in the CNRapproach, we may instead iterate over the clusteroids of each relevanceclass.

The notion of clusteroids can also be extended to clusteroid sets, e.g.,by selecting a certain number of files from each relevance class withthe smallest sums of distances (or highest sums of similarity) to allother files in the same relevance class. The number of files selectedfrom each relevance class to form the clusteroid set is a parameter ζthat can be configured. A clusteroid set for relevance class r, denotedby H(r), is given by

H(r)={h ₁(r),h ₂(r), . . . ,h _(|H(r)|)(r)},  (24)

where h_(i)(r) are obtained as follows:

$\begin{matrix}{{{{h_{i}(r)} = {\underset{f_{x} \in {F_{i}{(r)}}}{\arg \; \min}{\sum\limits_{f_{y} \in {F_{i}{(r)}}}{D_{\alpha}\left( {f_{x},f_{y}} \right)}}}},{r = 1},\; {.\;.\;.}\mspace{14mu},N_{R},{i = 1},\; {.\;.\;.}\mspace{14mu},{{H(r)}}}{or}} & (25) \\{{{h_{i}(r)} = {\underset{f_{x} \in {F_{i}{(r)}}}{\arg \; \max}{\sum\limits_{f_{y} \in {F_{i}{(r)}}}{S_{\alpha}\left( {f_{x},f_{y}} \right)}}}},{r = 1},\; {.\;.\;.}\mspace{14mu},N_{R},{i = 1},\; {.\;.\;.}\mspace{14mu},{{H(r)}}} & (26)\end{matrix}$

and F_(i)(r) are defined as follows:

F ₁(r)=F(r)  (27)

F _(i)(r)=F(r)\{h ₁(r), . . . ,h _(i-1)(r)}, i=2, . . . ,|H(r)|  (28)

Note that the set of clusteroids may also be obtained using othercriteria. Let H denote the union of all the clusteroid sets, that is,

$\begin{matrix}{H = {\underset{r = 1}{\bigcup\limits^{N_{R}}}{H(r)}}} & (29)\end{matrix}$

Then, the relevance of a new file f_(z) can be estimated as:

$\begin{matrix}{{{{R\left( f_{z} \right)} = r^{*}},{{{where}\mspace{14mu} r^{*}} = {\underset{r}{argmin}{\sum\limits_{{h{(r)}} \in {H{(r)}}}\frac{D_{\alpha}\left( {{h(r)},f_{z}} \right)}{{H(r)}}}}}}{or}} & (30) \\{{{R\left( f_{z} \right)} = r^{*}},{{{where}\mspace{14mu} r^{*}} = {\underset{r}{argmax}{\sum\limits_{{h{(r)}} \in {H{(r)}}}{\frac{S_{\alpha}\left( {{h(r)},f_{z}} \right)}{{H(r)}}.}}}}} & (31)\end{matrix}$

Unseen and missing metadata values. There may be cases when the ithmetadata field, k_(i), of a file f_(z) takes a value j that has not beenobserved in the training set. This implies that the correspondingempirical conditional probability distribution of the relevance metric,namely, P(R|v_(ki)=j), is not available. Similarly, there may be caseswhere the value corresponding to a particular metadata field is notavailable. In such cases, the conditional probability can be assumed tobe uniform over the relevance values. However, using a uniformdistribution may not perform well in all cases, especially for fileswhich do not have values for many metadata fields. Moreover, thepresence of multiple metadata values for a single metadata field willalso need to be addressed.

To address these issues, in embodiments, the set of metadata fields inthe training set, K(F_(train)), can be split into two sets:

A set, K_(q)(F_(train)), which consists of keys that are present in atleast a fraction q of all files and can have only a single value; and

A set, K_(z)(F_(train))=K(F_(train))\K_(q)(F_(train)), which consists ofkeys that are not present in at least a fraction q or that can have morethan one value.

The second set, K_(z)(F_(train))={k₁, k₂, . . . , k_(|Kz(Ftrain))|}, istypically small for high values of q because it consists of rare keys.Based on this set we construct a vector M of key-value pairs as follows:

(k_(i), v_(k_(i))⁽¹⁾), (k_(i), v_(k_(i))⁽²⁾), . . .  , (k_(i), v_(k_(i))^((V_(k_(i))(F_(train))))).

denotes the set of all values observed in the training set for metadatafield k_(i). Here, (k_(i), V_(ki)(F_(train))) is used as shorthand for

(k_(i), v_(k_(i))⁽¹⁾), (k_(i), v_(k_(i))⁽²⁾), ⋯ , (k_(i), v_(k_(i))^((V_(k_(i))(F_(train))))).

The length of this vector, denoted by L, is equal to the sum of thenumber of values seen for each metadata field in the K_(z)(F_(train)):

$\begin{matrix}{L = {\sum\limits_{f \in F_{train}}{{V_{k}(f)}}}} & (33)\end{matrix}$

The vector M can now be used to define a feature space X of dimension L,where each file f is represented by a row vector x(f)∈X as follows:

$\begin{matrix}{\mspace{79mu} {{{x(f)} = \left( {{x_{1}(f)},{x_{2}(f)},\; {.\;.\;.}\mspace{14mu},{x_{L}(f)}} \right)},\mspace{20mu} {where}}} & (34) \\{{x_{i}(f)} = \left\{ \begin{matrix}{1,} & {{if}\mspace{14mu} {the}\mspace{14mu} {ith}\mspace{14mu} {key}\text{-}{value}\mspace{14mu} {pair}\mspace{14mu} {in}\mspace{14mu} M\mspace{14mu} {is}\mspace{14mu} {present}\mspace{14mu} {in}\mspace{14mu} {file}\mspace{14mu} f} \\{0,} & {otherwise}\end{matrix} \right.} & (35)\end{matrix}$

The features as defined above may be used as input for a learningalgorithm that focuses on the key set K_(z)(F_(train)). Multinomiallogistic regression, the generalized version of binary logisticregression that is widely used for multiclass classification problems,is preferably used as the learning algorithm. As we have N_(R) relevanceclasses, one of the outcomes can be chosen as a pivot, and the otherN_(R)−1 classes can be separately regressed against the pivot outcome.The most relevant class is preferably chosen as the pivot class. As aresult, the algorithm will output a multinomial logistic regressionmodel, which contains N_(R)−1 binary logistic regression modelsregressed against the first class, which is chosen as the pivot, withloss functions Loss_(i) defined as follows:

Loss_(i)(w _(i) ,x(f),R(f))=log(1+exp(−R(f)x(f)w _(i) ^(T))) i=1, . . .,N _(R)−1  (36)

Here, X is a |F_(train)|×L feature matrix constructed as follows:

$\begin{matrix}{{X = \begin{bmatrix}{x\left( f_{1} \right)} \\{x\left( f_{2} \right)} \\\vdots \\{x\left( f_{F_{train}} \right)}\end{bmatrix}},{f_{i} \in {F_{train}{\forall i}}}} & (37)\end{matrix}$

And w_(i) is a row vector of length L containing weights, which are realnumbers that correspond to the ith logistic regression model.

Given a new file f, its corresponding feature vector x(f) can beconstructed based on M as described above. The estimation of relevanceclass is made by applying the logistic functions, g_(i)(f), which aredefined as follows:

$\begin{matrix}{{g_{1}(f)} = 0.5} & (38) \\{{{g_{i}(f)} = \frac{1}{1 + {\exp \left( {{- w_{i}}{x^{T}(f)}} \right)}}},{i = 2},\; {.\;.\;.}\mspace{14mu},N_{R}} & (39)\end{matrix}$

The class with the largest value of the logistic function is chosen asthe estimated class.

Given two files f_(x) and f_(y), and their estimated classes R(f_(x))and R(f_(y)) based on applying multinomial logistic regression on thevalues of metadata fields K_(z)(F_(train)), a vector ξ(f_(x),f_(y))=(ξ₁, . . . , ξ_(NR)) of length N_(R) is constructed as follows:

$\begin{matrix}{\xi_{i} = \left\{ {\begin{matrix}{1,} & {{{if}\mspace{14mu} {R\left( f_{x} \right)}} = {{R\left( f_{y} \right)} = i}} \\{0,} & {otherwise}\end{matrix},{{{for}\mspace{11mu} i} = 1},\; {.\;.\;.}\mspace{14mu},N_{R}} \right.} & (40)\end{matrix}$

To obtain a combined distance metric based on the values of both sets ofmetadata fields, K_(q)(F_(train)) and K_(z)(F_(train)), we introduceparameters β_(i), i=1, . . . , N_(R), and form a linear combination asfollows:

$\begin{matrix}{{D_{\alpha,\beta}\left( {f_{x},f_{y}} \right)} = {{\overset{{K_{q}{(F_{train})}}}{\sum\limits_{i = 1}}{\alpha_{i}{d_{i}\left( {x_{i},y_{i}} \right)}}} + {\overset{N_{R}}{\sum\limits_{i = 1}}{\beta_{i}{\xi_{i}.}}}}} & (41)\end{matrix}$

In a similar manner, a combined similarity metric can be obtained.Logistic regression models have a direct probabilistic interpretationfor the raw model outputs g_(i)(f). This is the main reason why logisticregression is here preferred over other models such as support vectormachines.

Additionally, some other aspects can be considered for improving thegeneralization of the model, convergence time, and accuracy. Arelatively important aspect is the regularization type which is used toavoid overfitting. The following two regularizers were tested for thelogistic regression model:

L1=∥W∥  (42)

L2=½∥|W∥ ₂ ²  (43)

Moreover, the model is preferably run with activated bias features andL2 regularizer in Eq. (43), for which the results are presented.

Updating the feature space for logistic regression. Logistic regressionrequires retraining on the updated feature space when new metadatafields and values are observed when more files are added to the trainingset. To solve this issue the concepts of streaming logistic regressionand extended feature space can be used. In streaming logistic regressionfitting occurs on each batch of data and the model continually updatesto reflect the data from the stream. Extended feature space is inspiredby dynamic array allocation in programming languages. We create thefeature space with capacity bigger than we need during the training. Forexample, if the vector M has length 20 based on a given training setF_(train), a capacity to accommodate up to, for example, 40 features ischosen and the feature vectors x(f) will be padded with zeros to achievethis extended length. An important observation here is that the paddedzero values will not influence model accuracy. When more files are addedto the training set, the vector M is updated with newly found key-valuepairs, if any.

If the capacity of the feature space is exceeded, then the followingsteps can be taken:

Create a new bigger feature space. In practice, we can extend thecurrent feature space, by padding the existing feature vectors x(f) withadditional zeros;

Create a new model and initialize it with new length of weight vectorequal to the length of feature space from previous step; and

Train model on a few previous batches.

For every file-set we have suitable distance metrics and a correspondingstreaming linear regression model. The unseen metadata values occur muchmore rarely within a given file-set. The concept of extended featurespace allows to keep the model up-to-date over long periods of time.

Experiments were performed on an Apache Spark™ cluster which consists offour worker nodes each of which was running two executors. Each executorhas 3 GB of RAM and 2 CPU cores. The driver node has 6 GB of RAM and 2CPU cores. Two datasets were considered.

For a given set of files, a distance (or similarity) learning method wasused, which divides the set of metadata fields into a set of commonsingle-valued fields and a set of uncommon multi-valued fields, anddefines a parameterized distance metric for each of these sets. The twometrics are combined and the parameters are learned using training data,according to principles discussed in sect. 2.2. When used on sets offiles that share a large number of metadata fields, this methodsignificantly reduces the number of parameters required to learndistances, while maintaining or surpassing the accuracy of other methodssuch as logistic regression that employ a far greater number ofparameters. Thus, the proposed learning method allows faster learningand is robust against overfitting. Moreover, the proposed method can beapplied to different supersets, which represent various contexts, tolearn a distance metric for each context. This paves the way for anefficient graph construction, allowing graph-based algorithms forestimating data relevance across multiple contexts.

Consider a heterogeneous bipartite graph as shown in FIG. 3 with twotypes of vertices, which are used to represent the file-sets and files,respectively. Each file is connected to all the file-sets to which itbelongs. Each file-set FS_(j) is associated with its distance metricD_(FSj). Each file f_(i) is associated with its metadata fields,K(f_(i)), the corresponding values, V(f_(i)), and its probabilitydistribution over the set of relevance values, P_(R)(f_(i)).

Learning of relevance is based on a message-passing algorithm, whereprobability distributions over relevance values are passed as messagesalong the edges of the graph. At the beginning, P_(R)(f_(i)) areinitialized to delta functions for the labeled files and to uniformdistributions for the unlabeled files, as shown in FIG. 3. In eachiteration, these probability distributions are sent to the file-setnodes. Following a computation at each file-set node using the messagesreceived and the distance metric associated with that file-set node, theresults of the computation are sent back to the file nodes. Thealgorithm can be stopped after one iteration, providing an estimate ofthe relevance distributions for the unlabeled files, or can be continuedfor a fixed number of iterations.

A preferred way to compute the message P_(R)(f) to be sent to each filef connected to a file-set FS is as follows:

$\begin{matrix}{{P_{R}(f)} = {\frac{\sum_{f^{\prime} \in {{FS}\backslash {\{ f\}}}}{{P_{R}\left( f^{\prime} \right)}/{D_{FS}\left( {f,f^{\prime}} \right)}}}{\sum_{f^{\prime\prime} \in {{FS}\backslash {\{ f\}}}}{1/{D_{FS}\left( {f,f^{\prime\prime}} \right)}}}.}} & (44)\end{matrix}$

In other words, the above distribution is a weighted average of thedistributions received at the file-set node FS (and pertaining to otherdatasets of said file-set FS), with weights equal to the inverse of thedistance between files.

To aggregate all the messages received at the file-nodes we simplymultiply all the probability distributions sent from the nodescorresponding to all the file-sets to which the file belongs, togetherwith the input distribution, and normalize the result.

Computerized devices can be suitably designed for implementingembodiments of the present invention as described herein. In thatrespect, it can be appreciated that the methods described herein arelargely non-interactive and automated. In example embodiments, themethods described herein can be implemented either in an interactive,partly-interactive or non-interactive system. The methods describedherein can be implemented in software (e.g., firmware), hardware, or acombination thereof. In example embodiments, the methods describedherein are implemented in software, as an executable program, the latterexecuted by suitable digital processing devices. More generally,embodiments of the present invention can be implemented whereingeneral-purpose digital computers, such as personal computers,workstations, etc., are used.

For instance, the system 1 and/or the units 10 depicted in FIG. 4 may,each, involve one or more computerized units 101, such as schematicallydepicted in FIG. 5, e.g., general-purpose computers. In exampleembodiments, in terms of hardware architecture, as shown in FIG. 5, theunit 101 includes a processor 105, memory 110 coupled to a memorycontroller 115, and one or more input and/or output (I/O) devices 145,150, 155 (or peripherals) that are communicatively coupled via a localinput/output controller 135. The input/output controller 135 can be, butis not limited to, one or more buses or other wired or wirelessconnections, as is known in the art. The input/output controller 135 mayhave additional elements, which are omitted for simplicity, such ascontrollers, buffers (caches), drivers, repeaters, and receivers, toenable communications. Further, the local interface may include address,control, and/or data connections to enable appropriate communicationsamong the aforementioned components.

The processor 105 is a hardware device for executing software,particularly that stored in memory 110. The processor 105 can be anycustom made or commercially available processor, a central processingunit (CPU), an auxiliary processor among several processors associatedwith the computer 101, a semiconductor based microprocessor (in the formof a microchip or chip set), or generally any device for executingsoftware instructions.

The memory 110 can include any one or combination of volatile memoryelements (e.g., random access memory) and nonvolatile memory elements(e.g., read-only memory). Moreover, the memory 110 may incorporateelectronic, magnetic, optical, and/or other types of storage media. Notethat the memory 110 can have a distributed architecture, where variouscomponents are situated remote from one another, but can be accessed bythe processor 105.

The software in memory 110 may include one or more separate programs,each of which comprises an ordered listing of executable instructionsfor implementing logical functions. In the example of FIG. 5, thesoftware in the memory 110 includes methods described herein inaccordance with example embodiments and a suitable operating system(OS). The OS essentially controls the execution of other computerprograms and provides scheduling, input-output control, file and datamanagement, memory management, and communication control and relatedservices.

The methods described herein may be in the form of a source program,executable program (object code), script, or any other entity comprisinga set of instructions to be performed. When in a source program form,then the program needs to be translated via a compiler, assembler,interpreter, or the like, as known per se, which may or may not beincluded within the memory 110, so as to operate properly in connectionwith the OS. Furthermore, the methods can be written as anobject-oriented programming language, which has classes of data andmethods, or a procedure programming language, which has routines,subroutines, and/or functions.

Possibly, a conventional keyboard 150 and mouse 155 can be coupled tothe input/output controller 135. Other I/O devices 145-155 may includeother hardware devices. In addition, the I/O devices 145-155 may furtherinclude devices that communicate both inputs and outputs. The system 10can further include a display controller 125 coupled to a display 130.In example embodiments, the system 10 can further include a networkinterface or transceiver 160 for coupling to a network (not shown) andthereby interact with other, similar units 101, making up a system suchas depicted in FIG. 4.

The network transmits and receives data between the unit 101 andexternal systems. The network is possibly implemented in a wirelessfashion, e.g., using wireless protocols and technologies, such as WiFi,WiMax, etc. The network may be a fixed wireless network, a wirelesslocal area network (LAN), a wireless wide area network (WAN) a personalarea network (PAN), a virtual private network (VPN), intranet or othersuitable network system and includes equipment for receiving andtransmitting signals.

The network can also be an IP-based network for communication betweenthe unit 101 and any external server, client and the like via abroadband connection. In example embodiments, network can be a managedIP network administered by a service provider. Besides, the network canbe a packet-switched network such as a LAN, WAN, Internet network, etc.

If the unit 101 is a PC, workstation, intelligent device or the like,the software in the memory 110 may further include a basic input outputsystem (BIOS). The BIOS is stored in ROM so that the BIOS can beexecuted when the computer 101 is activated.

When the unit 101 is in operation, the processor 105 is configured toexecute software stored within the memory 110, to communicate data toand from the memory 110, and to generally control operations of thecomputer 101 pursuant to the software. The methods described herein andthe OS, in whole or in part are read by the processor 105, typicallybuffered within the processor 105, and then executed. When the methodsdescribed herein are implemented in software, the methods can be storedon any computer readable medium, such as storage 120, for use by or inconnection with any computer related system or method.

The present invention may be a system, a method, and/or a computerprogram product. The computer program product may include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the C programminglanguage or similar programming languages. The computer readable programinstructions may execute entirely on the user's computer, partly on theuser's computer, as a stand-alone software package, partly on the user'scomputer and partly on a remote computer or entirely on the remotecomputer or server. In the latter scenario, the remote computer may beconnected to the user's computer through any type of network, includinga local area network (LAN) or a wide area network (WAN), or theconnection may be made to an external computer (for example, through theInternet using an Internet Service Provider). In some embodiments,electronic circuitry including, for example, programmable logiccircuitry, field-programmable gate arrays (FPGA), or programmable logicarrays (PLA) may execute the computer readable program instructions byutilizing state information of the computer readable programinstructions to personalize the electronic circuitry, in order toperform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general-purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The present invention may be a system, a method, and/or a computerprogram product at any possible technical detail level of integration.The computer program product may include a computer readable storagemedium (or media) having computer readable program instructions thereonfor causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, configuration data for integrated circuitry, oreither source code or object code written in any combination of one ormore programming languages, including an object oriented programminglanguage such as Smalltalk, C++, or the like, and procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The computer readable program instructions may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider). In some embodiments, electronic circuitry including,for example, programmable logic circuitry, field-programmable gatearrays (FPGA), or programmable logic arrays (PLA) may execute thecomputer readable program instructions by utilizing state information ofthe computer readable program instructions to personalize the electroniccircuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general-purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the blocks may occur out of theorder noted in the Figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

It is to be understood that although this disclosure includes a detaileddescription on cloud computing, implementation of the teachings recitedherein are not limited to a cloud computing environment. Rather,embodiments of the present invention are capable of being implemented inconjunction with any other type of computing environment now known orlater developed.

Cloud computing is a model of service delivery for enabling convenient,on-demand network access to a shared pool of configurable computingresources (e.g., networks, network bandwidth, servers, processing,memory, storage, applications, virtual machines, and services) that canbe rapidly provisioned and released with minimal management effort orinteraction with a provider of the service. This cloud model may includeat least five characteristics, at least three service models, and atleast four deployment models.

Characteristics are as follows:

On-demand self-service: a cloud consumer can unilaterally provisioncomputing capabilities, such as server time and network storage, asneeded automatically without requiring human interaction with theservice's provider.

Broad network access: capabilities are available over a network andaccessed through standard mechanisms that promote use by heterogeneousthin or thick client platforms (e.g., mobile phones, laptops, and PDAs).

Resource pooling: the provider's computing resources are pooled to servemultiple consumers using a multi-tenant model, with different physicaland virtual resources dynamically assigned and reassigned according todemand. There is a sense of location independence in that the consumergenerally has no control or knowledge over the exact location of theprovided resources but may be able to specify location at a higher levelof abstraction (e.g., country, state, or datacenter).

Rapid elasticity: capabilities can be rapidly and elasticallyprovisioned, in some cases automatically, to quickly scale out andrapidly released to quickly scale in. To the consumer, the capabilitiesavailable for provisioning often appear to be unlimited and can bepurchased in any quantity at any time.

Measured service: cloud systems automatically control and optimizeresource use by leveraging a metering capability at some level ofabstraction appropriate to the type of service (e.g., storage,processing, bandwidth, and active user accounts). Resource usage can bemonitored, controlled, and reported, providing transparency for both theprovider and consumer of the utilized service.

Service Models are as follows:

Software as a Service (SaaS): the capability provided to the consumer isto use the provider's applications running on a cloud infrastructure.The applications are accessible from various client devices through athin client interface such as a web browser (e.g., web-based e-mail).The consumer does not manage or control the underlying cloudinfrastructure including network, servers, operating systems, storage,or even individual application capabilities, with the possible exceptionof limited user-specific application configuration settings.

Platform as a Service (PaaS): the capability provided to the consumer isto deploy onto the cloud infrastructure consumer-created or acquiredapplications created using programming languages and tools supported bythe provider. The consumer does not manage or control the underlyingcloud infrastructure including networks, servers, operating systems, orstorage, but has control over the deployed applications and possiblyapplication hosting environment configurations.

Infrastructure as a Service (IaaS): the capability provided to theconsumer is to provision processing, storage, networks, and otherfundamental computing resources where the consumer is able to deploy andrun arbitrary software, which can include operating systems andapplications. The consumer does not manage or control the underlyingcloud infrastructure but has control over operating systems, storage,deployed applications, and possibly limited control of select networkingcomponents (e.g., host firewalls).

Deployment Models are as follows:

Private cloud: the cloud infrastructure is operated solely for anorganization. It may be managed by the organization or a third party andmay exist on-premises or off-premises.

Community cloud: the cloud infrastructure is shared by severalorganizations and supports a specific community that has shared concerns(e.g., mission, security requirements, policy, and complianceconsiderations). It may be managed by the organizations or a third partyand may exist on-premises or off-premises.

Public cloud: the cloud infrastructure is made available to the generalpublic or a large industry group and is owned by an organization sellingcloud services.

Hybrid cloud: the cloud infrastructure is a composition of two or moreclouds (private, community, or public) that remain unique entities butare bound together by standardized or proprietary technology thatenables data and application portability (e.g., cloud bursting forload-balancing between clouds).

A cloud computing environment is service oriented with a focus onstatelessness, low coupling, modularity, and semantic interoperability.At the heart of cloud computing is an infrastructure that includes anetwork of interconnected nodes.

Referring now to FIG. 6, illustrative cloud computing environment 50 isdepicted. As shown, cloud computing environment 50 includes one or morecloud computing nodes 10 with which local computing devices used bycloud consumers, such as, for example, personal digital assistant (PDA)or cellular telephone 54A, desktop computer 54B, laptop computer 54C,and/or automobile computer system 54N may communicate. Nodes 10 maycommunicate with one another. They may be grouped (not shown) physicallyor virtually, in one or more networks, such as Private, Community,Public, or Hybrid clouds as described hereinabove, or a combinationthereof. This allows cloud computing environment 50 to offerinfrastructure, platforms and/or software as services for which a cloudconsumer does not need to maintain resources on a local computingdevice. It is understood that the types of computing devices 54A-N shownin FIG. 6 are intended to be illustrative only and that computing nodes10 and cloud computing environment 50 can communicate with any type ofcomputerized device over any type of network and/or network addressableconnection (e.g., using a web browser).

Referring now to FIG. 7, a set of functional abstraction layers providedby cloud computing environment 50 (FIG. 6) is shown. It should beunderstood in advance that the components, layers, and functions shownin FIG. 7 are intended to be illustrative only and embodiments of theinvention are not limited thereto. As depicted, the following layers andcorresponding functions are provided:

Hardware and software layer 60 includes hardware and softwarecomponents. Examples of hardware components include: mainframes 61; RISC(Reduced Instruction Set Computer) architecture based servers 62;servers 63; blade servers 64; storage devices 65; and networks andnetworking components 66. In some embodiments, software componentsinclude network application server software 67 and database software 68.

Virtualization layer 70 provides an abstraction layer from which thefollowing examples of virtual entities may be provided: virtual servers71; virtual storage 72; virtual networks 73, including virtual privatenetworks; virtual applications and operating systems 74; and virtualclients 75.

In one example, management layer 80 may provide the functions describedbelow. Resource provisioning 81 provides dynamic procurement ofcomputing resources and other resources that are utilized to performtasks within the cloud computing environment. Metering and Pricing 82provide cost tracking as resources are utilized within the cloudcomputing environment, and billing or invoicing for consumption of theseresources. In one example, these resources may include applicationsoftware licenses. Security provides identity verification for cloudconsumers and tasks, as well as protection for data and other resources.User portal 83 provides access to the cloud computing environment forconsumers and system administrators. Service level management 84provides cloud computing resource allocation and management such thatrequired service levels are met. Service Level Agreement (SLA) planningand fulfillment 85 provide pre-arrangement for, and procurement of,cloud computing resources for which a future requirement is anticipatedin accordance with an SLA.

Workloads layer 90 provides examples of functionality for which thecloud computing environment may be utilized. Examples of workloads andfunctions which may be provided from this layer include: mapping andnavigation 91; software development and lifecycle management 92; virtualclassroom education delivery 93; data analytics processing 94;transaction processing 95; and dataset management system 96. Datasetmanagement system 96 may relate to managing datasets across storagetiers of a storage system, based on their relevance.

While the present invention has been described with reference to alimited number of embodiments, variants and the accompanying drawings,it will be understood by those skilled in the art that various changesmay be made, and equivalents may be substituted without departing fromthe scope of the present invention. In particular, a feature(device-like or method-like) recited in a given embodiment, variant orshown in a drawing may be combined with or replace another feature inanother embodiment, variant or drawing, without departing from the scopeof the present invention. Various combinations of the features describedin respect of any of the above embodiments or variants may accordinglybe contemplated, that remain within the scope of the appended claims. Inaddition, many minor modifications may be made to adapt a particularsituation or material to the teachings of the present invention withoutdeparting from its scope. Therefore, it is intended that the presentinvention not be limited to the particular embodiments disclosed, butthat the present invention will include all embodiments falling withinthe scope of the appended claims.

What is claimed is:
 1. A computer-implemented method for managingdatasets in a storage system, wherein a subset of the datasets islabeled with respect to their relevance, so as to be associated withrespective relevance values, the method comprising: defining supersetsof datasets, based on metadata available for the datasets; associatingeach of the datasets to at least one of the defined supersets bycomparing metadata available for each of the defined datasets withmetadata used to define the at least one of the defined supersets;defining a heterogeneous bipartite factor graph having two types ofnodes, wherein the datasets and the defined supersets are associatedwith a first type of nodes and a second type of nodes of the definedheterogeneous bipartite factor graph, respectively; associating eachunlabeled dataset to at least one of the defined supersets by connectingeach unlabeled dataset to at least one of the second type of nodes ofthe defined heterogeneous bipartite factor graph; determining, for eachunlabeled dataset of the datasets, a respective probability distributionover a set of relevance values, to obtain a corresponding relevancevalue, wherein the respective probability distribution is computed by amessage passing algorithm on the defined heterogeneous bipartite factorgraph, whereby probability distributions are passed as messages alongedges of the defined heterogeneous bipartite factor graph that connectpairs of nodes of different types in the defined heterogeneous bipartitefactor graph; and managing the datasets in the storage system based ontheir associated relevance values, wherein managing the datasetscomprises storing the datasets across storage tiers of the storagesystem based on their relevance values.